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CHAPTER  1 
INTRODUCTION 

Pattern  recognition  techniques  can  be  classified  into  three  groups  which  are 
syntactic,  neural,  and  statistical  [22:2].  Syntactical  pattern  recognition  techniques  analyze 
the  structure  of  patterns.  This  approach  is  used  in  such  areas  as  model-based  vision  and 
artificial  intelligence  [22: 19].  Artificial  Neural  Networks  (ANNs)  are  mathematical 
models  of  biological  neurons.  ANNs  have  been  employed  in  pattern  recognition  tasks 
such  as  optical  character  recognition  [11 :27],  Statistical  pattern  recognition  covers  a  wide 
variety  of  algorithms  whose  fundamental  building  blocks  are  statistics  and  probability. 

This  thesis  mainly  discusses  statistical  techniques. 

Statistical  pattern  recognition  can  be  subdivided  into  two  groups  called  parametric 
and  nonparametric.  Parametric  techniques  assume  an  underlying  probability  density 
function  [5:85],  [22:66].  Probability  models  are  generally  described  by  parameters, 
therefore,  when  techniques  make  use  of  probability  models,  they  are  called  parametric. 
Nonparametric  techniques,  on  the  other  hand,  do  not  assume  a  probability  model.  They, 
like  parametric  techniques,  however,  may  also  require  parameters.  Nonparametric 
techniques  can  ignore  probability  models  altogether,  or  in  some  cases  they  can  be  used  to 
estimate  a  probability  model. 

The  Bayesian  parametric  technique  is  the  best  possible  approach  since  its 
performance  is  optimal  [5: 17],  [18:45],  However,  it  requires  a  priori  knowledge  of  the 
probability  model  which  is  rarely  if  ever  known  in  practical  applications  [5:44].  On  the 
other  hand,  nonparametric  techniques  do  not  require  probability  models.  An  advantage  of 
some  nonparametric  techniques  is  that  they  asymptotically  approach  the  true  density. 
Therefore,  Bayes  optimal  performance  can  be  approached.  But,  nonparametric  techniques 
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often  require  large  sample  sets,  which,  in  turn,  require  significant  storage  resources 
[5: 122].  Reducing  storage  requirements  without  adversely  affecting  classification 
performance  is  a  classic  problem  in  statistical  pattern  recognition.  Many  approaches  have 
been  proposed  in  statistical  pattern  recognition  literature.  P.  E.  Hart  proposed  a 
condensed  nearest-neighbor  rule  in  1968  [12].  In  1972,  D.  L.  Wilson  analyzes  a  nearest- 
neighbor  rule  which  uses  an  edited  data  set  [25].  K.  Fukunaga  et  al.,  addressed  the 
storage  problem  with  a  nonparametric  data  reduction  technique  in  1984  [7],  and  they 
proposed  a  reduced  Parzen  classifier  in  1989  [8],  More  recently.  Radial  Basis  Function 
(RBF)  neural  network  training  paradigms  have  been  proposed  to  reduce  the  number  of 
training  samples  [3],  [16],  [17],  In  this  thesis,  a  novel  approach  is  presented  which  allows 
direct  control  over  storage  reduction  via  a  single  system  parameter. 

This  thesis  presents  a  pattern  recognition  system  called  Weighted  Parzen  Windows 
(WPW).  Statistical  pattern  recognition  techniques  and  concepts  are  central  to  the 
development  and  analysis  of  the  WPW  approach.  Chapter  2  reviews  statistical  pattern 
recognition.  Chapter  3  presents  the  WPW  algorithm  and  Chapter  4  presents  analytical 
results  of  special  case  training  scenarios.  Chapter  5  presents  experimental  results,  and 
Chapter  6  presents  a  summary  and  conclusions. 
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CHAPTER  2 

STATISTICAL  PATTERN  RECOGNITION 

In  this  chapter,  several  pattern  recognition  concepts  and  techniques  are  reviewed 
They  are  fundamental  to  the  formulation  and  subsequent  analysis  of  the  weighted-Parzen- 
wtndow  technique.  Key  pattern  recognition  techniques  such  as  training,  classification,  and 
discriminant  functions  are  discussed  in  the  following  sections.  Also,  traditional  parametric 
and  nonparametric  pattern  recognition  techniques  are  presented.  They  are  the  Bayes, 
minimum  distance,  Parzen  window,  and  ^-nearest-neighbor  classifiers. 

Training 

A  typical  pattern  recognition  system  uses  sensors  to  measure  a  state  of  nature. 
Once  a  measurement  is  made,  a  feature  extractor  is  used  to  extract  features  [5:2],  A 
feature  can  be  any  quantity  which  provides  a  meaningful  description  of  the  state  of  nature. 
Throughout  this  thesis,  the  data  are  assumed  to  be  feature  data.  Feature  extraction  is 
discussed  in  references  [S],  [9],  and  [15].  Features  are  organized  into  multidimensional 
feature  vectors  —  x.  Feature  vectors  are  also  called  sample  vectors.  Once  the  feature  data 
is  acquired,  training  can  begin.  This  is  a  fundamental  concept  in  pattern  recognition. 
Training,  often  referred  to  as  learning,  is  the  process  of  extracting  information  from  a 
training  set  of  feature  data.  Techniques  can  be  extremely  complicated  or  simple.  For 
example,  some  neural  paradigms  may  require  billions  of  computer  operations  sometimes 
taking  hours  or  even  days  to  extract  information.  On  the  other  hand,  some  statistical 
techniques  may  only  require  learning  the  mean  of  the  training  set.  The  simplest  form  of 
learning  is  employed  by  some  techniques  whose  training  phase  consists  of  storing  a 
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training  set.  This  can  be  likened  to  rote  memorization.  Although  there  are  many  training 
algorithms,  they  all  perform  the  same  task.  They  extract  information  from  the  training  set 
This  information  is  used  to  assign  class  membership  to  unlabeled  samples,  which  is  called 
classification. 

Classification  by  Discriminant  Analysis 

Classification  is  often  accomplished  by  discriminant  analysis.  In  a  discriminant 
analysis,  discriminant  functions  are  used  for  each  class  of  data  which  are  labeled  ©,  where 
i  =  l, ...  ,c.  Discriminant  functions  are  scalar- valued  vector  functions  denoted  as  £,(x) 
where  i  =  l, ...  ,c.  Classification  is  traditionally  accomplished  by  assigning  the  class 
label  of  largest  valued  discriminant  function  to  an  unlabeled  feature  vector  [5:17],  [18:6], 
This  type  of  analysis  leads  to  decision  boundaries  where  g,(x)  =  gfix)  for  /  *  j.  In  the  case 
that  an  unlabeled  sample  falls  on  a  decision  boundary,  it  is  usually  assigned  class 
membership  by  some  convenient  and  arbitrary  rule.  Discriminant  analysis  is  implemented 
by  a  pattern  classifier  or  simply  a  classifier. 

Properties  of  Discriminant  Functions 

Since  a  pattern  classifier  compares  all  discriminant  function  outputs  to  find  the 
maximum,  only  their  relative  values  are  important.  Therefore,  equivalent  changes  can  be 
made  to  each  discriminant  function  without  affecting  the  classification  results.  In  other 
words,  the  decision  boundaries  are  not  changed.  Some  mathematical  operations  that  do 
not  change  classification  results  are  multiplication  or  division  by  a  positive  constant, 
addition  or  deletion  of  a  bias,  replacement  of  g,{x)  by^g/(x)),  where/ is  a  monotonically 
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increasing  function  [5: 17],  As  long  as  each  discriminant  function  is  changed  in  the  same 
way,  the  classification  results  will  not  be  changed. 

The  Bayes  Decision  Strategy 


The  Bayes  decision  strategy  yields  a  classifier  that  is  optimal,  i.e.,  the  classification 
error  rate  is  minimal.  This  concept  is  paramount  to  pattern  recognition  regardless  of  the 
technique  used.  Therefore,  a  detailed  discussion  will  follow.  First,  Bayes  rule  will  be 
presented.  Then,  the  concept  of  conditional  risk  will  be  used  to  derive  the  optimal 
classifier.  Finally,  the  Gaussian  multivariate  discriminant  function  will  be  presented.  The 
formulation  of  this  section  follows  that  of  Duda  and  Hart  [5]. 

Bayes  Rule  and  Conditional  Risk.  Let  x  be  a  d  component  feature  vector  which 
obeys  the  class  conditional  probability  density  function  p(x|©j),  where  ©,  represents  one 
of  c  possible  states  of  nature  that  are  of  interest,  and  PC©;)  represents  the  a  priori 
probability  that  ©;  occurs.  The  conditional  a  posteriori  probability  of  a  state  of  nature  can 
be  expressed  by  Bayes  rule: 


where 


P(a,|x)  = 


P(X|©  ;)/>(©;) 

P(x) 


(2.1) 


P(x)  =  £  p(x  I©  i)P(©/)-  (22) 

i=l 

Now  that  the  a  posteriori  probability  function  has  been  found  by  Bayes  rule,  the 
next  step  is  to  define  a  suitable  loss  function.  Given  the  action  a,  and  the  true  state  of 
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nature  is  toy,  the  conditional  loss  is  given  by  X(ct,|  coy).  The  conditional  risk  or  expected 
loss  of  taking  the  action  a /  is  given  by 

p(ot,  |x)  =  £*(a,  |©yMfl>y|x)  (2.3) 

7  = 1 

The  optimal  decision  rule  is  one  that  achieves  the  minimal  overall  conditional  risk. 
Such  a  decision  rule  can  be  achieved  by  taking  the  action  that  minimizes  the  overall  risk 
[5:17],  [18:45],  [15:188-190]  as  given  by  Equation  (2.3).  A  commonly  used  loss  function 
that  minimizes  the  overall  risk  is  the  symmetrical  or  zero-one  loss  function  [5: 16].  This 
function  assigns  a  unit  loss  when  the  action  a/  is  taken  and  the  actual  state  of  nature  is  toy , 
if  and  only  if  /  *  j.  Stated  mathematically 

X(a,|toy)  =  |j  l.J.  ij  =  l,...,c  (2.4) 

Using  Equation  (2.4),  the  optimal  decision  rule  can  be  derived.  The  conditional  risk,  given 
by  Equation  (2.3),  can  be  simplified  by  substituting  Equation  (2.4)  for  the  loss  function. 
Since  the  sum  of  the  conditional  probability  mass  functions  /’(toy  |  x)  taken  over  all  j  must 
equal  1,  the  conditional  risk  can  be  simplified  as  shown: 

p(a,l*)  =  Z>'(a,|G>yW«>;|x) 

=  Z  A*j  I*) 

7=1 

j*i 

=  1- P(®i|x).  (2.5) 
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With  this  result  it  is  clear  that  the  minimal  error  rate  can  be  achieved  by  selecting  the  class 
/  that  minimizes  Equation  (2.5),  i.e.,  the  class  with  the  largest  a  posteriori  conditional 
probability. 

Discriminant  Analysis  with  Bayes  Rule.  Equation  (2.5)  is  especially  suited  for 
discriminant  analysis.  Since  the  error  rate  is  minimal  when  choosing  the  class  with  the 
largest  conditional  probability,  one  needs  only  to  calculate  /’(cd/ 1  x)  for  each  class  i  and 
label  the  test  sample  according  to  the  largest  value.  In  other  words,  the  optimal  decision 
rule  with  the  smallest  possible  error  [5, 17]  is  given  as: 

Decide  that  x  belongs  to  class  to,  iff  P(cd,  |  x)  >  P(coy  |  x)  for  all  j  *  i.  (2.6) 
The  decision  rule  is  also  given  by: 

Decide  that  x  belongs  to  class  cd,  iff  ft(x)  >  gj{\)  for  all 7  *  i  (2.7) 

where,  by  Bayes  rule 


&(x)  =  P(®,  N)  = 


/>(x) 


which  can  be  simplified  by  removing  the  scaling  constant  p(x): 

|,(x)  =  /Hx  !©,)/’(©,). 


(2.8) 


The  tilde  notation  of  Equation  (2.8)  indicates  that  the  discriminant  function  has  been 
changed,  but  the  classification  results  remain  the  same. 

The  Bayes-Gaussian  Discriminant  Function.  In  what  follows,  the  Bayes- 
Gaussian  decision  rule  is  derived.  It  will  be  shown  that  the  resulting  discriminant  function 
is  quadratic  in  the  most  general  case  and  linear  when  certain  assumptions  are  made. 

The  discriminant  function  is  derived  by  first  eliminating  the  scaling  factor  p(i) 
from  Equation  (2.8)  since  it  is  common  to  all  classes  [5:17-18].  The  discriminant  function 
is  given  as 

&(x)  =  /t(x|©,)/,(®,).  (2.9) 
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The  Gaussian  multivariate  conditional  density  for  the  ith  class  in  ^-dimensional  space  is 
given  by 


Pi* K)  = - j - rexpR(x'»,-),Ii  Hx-ia*)]  (2.10) 

(2xH|Z,.|2  L  2  J 

where  it,  and  2/  are  the  mean  vector  and  covariance  matrix  respectively,  and  |  •  |  is  the 
determinant.  Substituting  Equation  (2. 10)  into  Equation  (2.9),  the  discrim  unction 
becomes 

&(*)  = - j - j-exp  [-^(x  -  Kx  -  H.yUo,).  (2.1 1) 

(2xHlS.II  L  2  J 

By  taking  the  natural  log  of  Equation  (2  .11),  the  quadratic  discriminant  function  is 
obtained.  Equation  (2. 12)  shows  the  general  case  quadratic  discriminant  function,  which 
gives  the  same  classification  results  as  the  exponential  discriminant  function. 

&(x)  =  ”(x  -  Hf)'lf!(x  -  Hi)  ~  f  ln2x  -  ^ln|2,|  +  In  /fa,-)  (2.12) 

The  quadratic  discriminant  function  can  be  further  reduced  to  a  linear  discriminant 
function  when  the  covariance  matrices  are  equivalent  for  each  class  (Ej  =  2).  In  what 
follows,  the  linear  discriminant  function  is  derived.  First,  Equation  (2. 12)  is  simplified  by 
removing  bias  terms  that  are  present  in  each  discriminant  function.  The  simplified  version 
is  given  by  Equation  (2. 13): 

&(*)  =  "|(x-H/)'2_1(x  -H/)  +  In  />(©/)  (2.13) 

Then,  the  quadratic  term  is  expanded. 

&(*)  =  ”(x*2-1x  -  2n/Z~1x  +  p/Z'V.)  +  In  P(©,)-  (214) 
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Inspection  of  Equation  (2. 14)  reveals  that  the  quadratic  term,  in  x,  can  be  removed  since  it 
is  common  to  each  discriminant  function.  The  linear  discriminant  function  is  given  by 

&<x)  =  (r-Vi)'x  -jM.'E'V,  +ln  P(o>,)  (2  15) 

Another  linear  discriminant  function  can  be  derived  for  the  case  when  Ij  =  a^I, 
where  /  is  a  dxd  identity  matrix.  In  this  case,  the  discriminant  function  is  given  by 

=  j  +lnP(a>l).  (2.16) 

This  section  has  shown  that  the  Bayes  decision  strategy  is  used  to  find  an  optimal 
classifier.  Furthermore,  three  Bayesian  discriminant  functions  were  derived  for  a  Gaussian 
distribution.  The  first  of  these,  the  arbitrary  case,  was  shown  to  be  quadratic  (Equation 
(2. 12)).  The  decision  boundaries  for  the  quadratic  case  are  hyperquadrics  [5:30].  The 
second  two  discriminant  functions  were  derived  for  the  case  when  covariance  matrices  are 
identical  for  each  class.  The  decision  boundaries  for  each  of  these  cases  are  hyperplanes 
[5:26-30], 

Minimum  Distance  Classification 

Minimum  distance  classifiers  are  widely  referenced  throughout  the  literature  [5], 
[18],  [20],  [22].  Quite  often  the  mean  or  sample  mean  of  a  class  is  used  as  a  prototype. 
With  this  type  of  classifier,  unknown  feature  vectors  are  assigned  the  class  membership  of 
the  nearest  mean.  Two  metrics  are  commonly  used  --  Mahalanobis  and  Euclidean.  The 
following  reviews  both  of  these  classifiers. 


Minimum  Mahalanobis-Distance  The  squared  Mahalanobis  distance  of  a 
feature  vector  x  from  the  /th  class  mean  is  given  by 

«&(*.!*.• )  =  U  -nJ'ir'U- 14/)  (2  17) 

where  ji/  and  I ;  are  the  mean  vector  and  covariance  matrix  respectively.  Since  class 
membership  is  assigned  based  on  the  smallest  distance  given  by  Equation  (2. 17),  a 
discriminant  function  can  be  written  as 

g,(x)  =  -(x  -  |i,)‘  I~l(x  -  n,).  (2.18) 

Equation  (2. 18)  bears  resemblance  to  the  Bayes-Gaussian  discriminant  function  of 
Equation  (2. 12).  In  fact,  the  minimum  Mahalanobis-distance  classifier  is  optimal  in  the 
case  of  Gaussian  distributions  with  equal  covariance  matrices  and  equal  a  priori 
probabilities.  However,  the  discriminant  function  of  (2. 18)  is  strictly  nonparameteric. 
That  is  to  say  that  no  underlying  distribution  is  assumed.  Therefore,  the  mean  vector  and 
covariance  matrices  are  generally  found  by  samples.  Given  n,  training  samples  from  the 
/th  class,  the  sample  mean  [5:48]  is  given  by 

A,=^-Zx«y  (2  19) 

"‘>1 

The  sample  covariance  matrix  [5:49]  is  given  by 

=  —77  Z  (xy  ‘  MiXxy  -  £/)'  (2-20) 

ni  1  jm  1 

Minimum  Euclidean-Distance.  The  squared  Euclidean  distance  of  a  feature 
vector  x  from  the  /th  class  mean  is  given  by 

</2(x,n,)  =  (x-n,)'(x-n,)  (2.21) 

where  |i/  is  the  mean  vector.  Since  class  membership  is  assigned  based  on  the  smallest 
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(2.22) 


distance  given  by  Equation  (2.21),  a  discriminant  function  can  be  written  as 

ft(x)  =  -(x-^)'(i-til). 

As  with  the  Mahalanobis  classifier,  it  can  be  shown  that  Equation  (2.22)  is  optimal  in 
certain  cases,  i.e.,  when  I,  =  a2/.  However,  the  Euclidean  classifier  is  generally 
nonparametric  since  no  density  function  is  assumed.  The  mean  vector  of  Equation  (2.22) 
is  usually  found  by  Equation  (2.19).  Below,  it  is  shown  that  the  minimum 
Euclidean-di stance  classifier  can  be  implemented  by  a  linear  discriminant  function.  First, 
Equation  (2.22)  is  expanded.  This  reveals  that  x'x  is  a  bias  term  present  in  each 
discriminant  function.  It  is  removed  and  the  new  discriminant  function  is  denoted  by  the 
tilde  notation  which  indicates  that  classification  results  are  not  changed.  Equation  (2.23) 
shows  the  final  result  which  gives  the  same  classification  results  as  Equation  (2.22). 

&(x)  =  -x'x  +  2|i/x  -  n/n, 

&(x)  =  2ji/x  -  n/n, 

Ji(x)  =  |l,,x-|l.>i  (2.23) 

As  can  be  seen  by  Equation  (2.23),  the  minimum  Euclidean-distance  classifier  is  linear 
in  x. 

The  minimum  Mahalanobis-distance  classifier  is  quadratic.  Therefore,  its  decision 
boundaries  are  hyperquadric  surfaces.  It  has  been  shown  that  the  Euclidean  distance 
classifier  can  be  implemented  as  a  linear  discriminant  function;  therefore,  its  decision 
boundaries  are  hyperplanes.  The  performance  of  the  quadratic  classifier  often  suffers  due 
to  non-normality  of  the  data;  however,  the  linear  classifier  is  robust  to  non-normality 
[20:253], 
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Pa rzen- Window  Density  Estimation  and  Classification 


This  is  a  nonparametric  technique  that  assumes  no  underlying  distribution  but 
estimates  a  probability  density  function.  In  his  classic  paper  "On  Estimation  of  Probability 
Density  Function  and  Mode,"  Parzen  showed  that  the  density  estimate  will  approach  the 
actual  density  as  the  number  of  training  samples  approaches  infinity  [19].  This  is  true  for 
certain  easily  met  conditions.  The  density  model  and  conditions  necessary  for 
convergence  are  discussed  below. 

Let  n  be  the  number  of  samples  drawn  form  a  particular  distribution/^*)  The 
general  form  of  the  probability  density  estimate  />„(*)  in  the  Parzen-window  technique  is 

*w-5|<£Tt)  (224) 

where  h  is  a  parameter  "suitably  chosen"  [19:1066],  *y  is  the yth  training  sample,  and  cp(y) 
is  the  window  function.  Note  that  the  notation  above  is  for  a  univariate  training  set. 
Parzen  states  that  if  h  is  chosen  to  satisfy  a  mild  restriction  as  a  function  of  w,  the  estimate 
p„{x)  is  asymptotically  unbiased,  or  in  mathematical  terms  if 

Iim  *(»)  =  °*  (2-25) 


and. 


then 


lim  n^xnh(n)- 


(2.26) 


lim„_>00E[/;(l(x)]=p(x).  (2.27) 

Parzen  also  shows  that  the  window  function  q>(y)  must  satisfy  the  following 
requirements: 

sup  |<p(y)|<*>  (2.28) 

-oo<y<ao 
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(2.29) 


00 

J \v(ytyy  < 00 

-oo 

lim  |y<p(y)|  =  0  (2.30) 

y-i >oo 

00 

J<POO^  =  1  (2.31) 

-00 

where  |  •  |  is  the  absolute  value. 

A  popular  window  shape  is  Gaussian.  The  density  estimate  for  a  multivariate 
Gaussian  window  is  given  by 


P.to  =  -Z - 7«p  [-A(>  -  */)'(*  -  *,)1  (2.32) 

L  2°  J 

where  x;  is  the  /th  training  sample,  d  is  the  number  of  dimensions,  and  n  is  the  number  of 
training  samples  in  the  /th  class.  Note  that  a  replaces  h  to  emphasize  the  relation  to  the 
Gaussian  density  function.  Equation  (2.33)  shows  the  discriminant  function  form  of  the 
Parzen  estimate  in  a  Bayes  strategy. 

«(*)=  fW-J-Z - ' — 7 exp  (233) 

^3-i(2to,2)2  L  J 

The  finite  sample  case  of  the  Parzen-window  classifier  is  not  generally  optimal.  In 
these  cases,  selection  of  the  h  parameter,  often  called  the  smoothing  parameter,  greatly 
affects  the  classifier's  performance.  Selection  of  the  smoothing  parameter  is  discussed 
throughout  the  literature  [S],  [9],  [19],  [10],  [20],  Due  to  the  intractable  nature  of  an 
analytical  solution,  experimental  approaches  are  generally  used  to  find  the  appropriate 


smoothing  parameter.  Reference  [20:254]  suggests  a  technique  in  which  several 
•  smoothing  parameters  are  tested  simultaneously  to  find  the  best  choice 


The  A-Nearest-Neighbor  Rule 

The  ^-Nearest-Neighbor  (ANN)  technique  is  nonparametric,  assuming  nothing 
about  the  distribution  of  the  data.  Stated  succinctly,  this  rule  assigns  the  class  membership 
of  an  unlabeled  pattern  to  the  same  class  as  its  A-nearest  training  patterns.  In  the  case  that 
not  all  the  neighbors  are  from  the  same  class,  a  voting  scheme  is  used.  Duda  and  Hart 
[5:104]  state  that  this  rule  can  be  viewed  as  an  estimate  of  "the  a  posteriori  probabilities 
P((Hj  |  x)  from  samples."  Raudys  and  Jain  [20:255]  advance  this  interpretation  by  pointing 
out  that  the  ANN  technique  can  be  viewed  as  the  "Parzen  window  classifier  with  a  hyper- 
rectangular  window  function."  As  with  the  Parzen- window  technique,  the  ANN  classifier 
is  more  accurate  as  the  number  of  training  samples  increases  [5: 105], 

A  special  case  of  the  ANN  technique  is  when  A  -  1 .  This  case,  known  as  the  NN 
classifier,  was  studied  in  detail  by  Cover  and  Hart  [4],  who  showed  that  its  performance 
was  bounded  by  twice  the  Bayes  error  rate  in  the  "large  sample  case."  The  NN  rule  can  be 
stated  in  discriminant  function  form  as 

g,(x)  =  max  [-(x  -  xiy)'(x  -  x,y)j  (2.34) 

where  x<y  is  the yth  training  sample  of  the  class  labeled  ©/,  and  x  denotes  the  unknown  test 
pattern.  A  simplified  version  of  this  discriminant  function  is  given  by 

&(x)  =  max  (V/x  “i|/xy)  (235) 
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CHAPTER  3 

WEIGHTED  PARZEN  WINDOWS 


This  chapter  introduces  a  novel  nonparametric  pattern  recognition  approach, 
named  Weighted  Parzen  Windows  (WPW).  First,  the  training  and  classification 
algorithms  are  presented.  Then,  it  is  shown  that  the  training  algorithm  is  stepwise  optimal 
Also,  the  computational  complexity  of  the  training  and  classification  algorithms  is 
discussed.  Finally,  design  considerations  are  discussed. 

Weighted-Pa  rzen-Window  Training 

Training  is  a  parallel  operation  in  that  training  for  the  class  labeled  ra,  is 
independent  of  the  training  for  the  class  labeled  ©y,  for  /  *  j  Therefore,  training  can  be 
conducted  in  parallel.  Thus,  the  following  discussion  will  focus  on  a  single  class  to 
simplify  notation.  The  training  phase  will  be  presented  in  algorithmic  form  followed  by  a 
discussion  of  the  major  concepts. 

Training  Algorithm.  Given  a  training  set  of  n  {/-dimensional  feature  samples, 

X  =  {  xj, . . . ,  \n  }  where  x  =  [  Xj, . . . ,  xj  ]*  the  basic  approach  of  the  WPW  training 
algorithm  is  to  find  a  set  of  h  reference  vectors1,  R  -  {  rj, . . . ,  r.  },  where  1  <,h  ^n. 

Since  the  number  of  samples  in  R  can  be  less  than  the  number  of  samples  in  the  original 
training  set,  some  information  may  be  lost.  To  compensate  for  lost  information,  a  set  of 
n  weights,  w  =  {  wj, . . . ,  w.  },  are  found.  Each  scalar  weight  wj  corresponds  to  the 


1  If  readers  are  familiar  with  classical  Vector  Quantization  (VQ),  they  will  recognize  that  a  collection  of 
reference  vectors  is  the  same  as  a  codebook.  Current  research  has  focused  on  pattern  recognition, 
although  the  training  algorithm  is  directly  applicable  to  VQ  applications.  Excellent  treatment  of  classical 
VQ  can  be  found  in  reference  [  1  ]. 
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reference  vector  rj,j=l,..,n  The  role  of  the  weights  is  discussed  in  the  sequel.  The 
training  algorithm  for  a  single  class  is  presented  in  Table  3.1.  In  Table  3  1,  the  estimate, 
p(x),  is  given  by  Equation  (3 . 1  )2. 

<3,) 

Training  Concepts.  The  WPW  algorithm  can  be  considered  a  second  order 
approximation  since  it  is  relying  on  the  Parzen  estimate  and,  therefore,  can  only  be  as 
accurate  as  the  Parzen  estimate.  As  can  be  seen  by  Equation  (3.1),  the  WPW  estimate  is  a 
superposition  of  weighted  Parzen  windows  This  estimate  is  a  quantized  version  of  the 
Parzen  estimate.  Quantization  occurs  when  two  window  functions  that  are  close  in  vector 
space  are  combined  to  create  a  new  single  weighted-window  function.  The  new  window 
function  is  weighted  by  the  total  number  of  combinations  that  its  center  has  undergone.  In 
other  words,  wt  window  functions  are  centered  at  r,  which  is  the  average  vector  of  w, 
similar  training  samples.  This  procedure  allows  the  training  algorithm  to  learn  the  densest 
regions  of  the  training  set,  which,  in  turn,  allows  reference  vector  reduction.  This 
reduction  is  offset  by  weight  adjustments,  which  allow  the  algorithm  to  remember  where 
the  densest  regions  occur.  The  training  algorithm  allows  for  quantization  of  the  vector 
space  with  respect  to  the  probability  space.  In  this  technique,  storage  requirements  are 
traded-off  for  probability  space  error. 


2  Neural  network  literature  often  refers  to  this  type  of  equation  as  a  radial  basis  function  (RBF)  [3],  (16], 
[17],  Current  research  has  focused  on  statistical  panem  recognition,  although  WPW  training  is  directly 
applicable  to  RBF  neural  network  design. 
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Table  3  .1:  WPW  training  algorithm  for  a  single  class  of  feature  data 

Step  1.  Calculate  and  store  prt(x^)  where  k  =  1, .  .  . ,  n 
Step  2.  Choose:  h  >  0,  emax  >  0. 

Step  3.  Initialize:  h  «-  n,  R+-X,  w,=  1  where  /  =  1, .  .  ,  n 
Step  4.  Choose  two  closest3  reference  vectors  r,  and  r mj. 

r.W:  +  r.W; 

Step  5.  Calculate  the  vector  rQ  = - — - . 

Wi  +  Wj 

Step  6.  (a)  update/?  such  that  {  rJt  r,  }«/?  and  r0  e/?, 

(b)  update  the  coefficient  for  rG  ,  w0  <- 

(c)  update  n  <-  h  -  1. 

Step  7. 

.n  k-i  Pn\*k) 

Step  8.  IF  ( e  <  emax )  THEN, 

if  h  =  1,  then  stop  training,  output  R  and  w; 
otherwise,  go  to  Step  4; 

ELSEIF  (  e  >  emax )  THEN 

reconstruct  R  such  that  {  rJ(  ry  }  e  R  and  rQ  «  R, 
replace  coefficients  for  r,,  rj  as  wjt  wj , 
adjust  h  <-  h  +  1, 
stop  training,  output  R  and  w; 

END  IF 

3  The  meaning  of  closest  is  not  discussed  in  detail  in  this  section.  In  general,  however,  the  closeness  of 
two  reference  vectors  should  be  measured  with  the  metric  that  the  window  function  uses.  A  detailed 
discussion  can  be  found  in  the  section  which  addresses  stepwise  optimization. 
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Combination  of  reference  vectors  is  controlled  by  the  training  algorithm.  An  error 
function  is  used  to  measure  the  deviation  of  the  new  estimate  p(x)  from  the  Parzen 
estimate  p„(\)  at  each  of  the  training  samples.  The  error  function  in  step  7  of  the  training 
algorithm  (Equation  (3.2))  is  the  average  percent  error  between  the  two  estimates  at  each 
training  sample. 


e  = 


1  AM**)-#**) 

Pn(*k ) 


100%. 


(3.2) 


As  long  as  e  is  below  emax,  reference  vectors  will  be  combined  according  to  step  5 
(Equation  (3.3))  of  the  training  algorithm,  and  h  will  continue  to  decrease. 


TjWj  +  TjWj 
Wi  +  Wj 


W;  +  Wj 


»i  + 


W, 


W,  +  Wj 


J 


(33) 


When  reference  vectors  are  combined,  weights  are  combined  according  to 

w0=Wj  +  Wj  i*j.  (3.4) 

(Note:  the  training  algorithm  requires  that  j  Wj  =  n  .)  Clearly,  the  weights  are  a 


method  of  counting  the  number  of  reference  vectors  combined  into  a  single  reference 
vector,  and  they  are  integer  values. 


Classification  Algorithm 


Once  the  reference  vectors  and  weights  are  found  for  each  category,  a  discriminant 
analysis  approach  can  be  used.  For  the  reference  vectors  Rj  and  the  weights  w„  the 
discriminant  function  for  the  /th  category  is  given  by 
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(3  5) 


*W-*W-ir24]T3LK 

ni rA  j=\  \  A  / 

where  P(co/)  is  the  a  priori  probability  of  the  class  labeled  co(,  n,  is  the  number  of  training 
samples  in  the  /th  class,  n ,  is  the  number  of  reference  vectors  in  the  /th  class,  ry  is  the yth 
reference  vector  of  the  /th  class,  wy  is  the  yth  coefficient  corresponding  to  ry,  <p(  ■ )  is  the 
window  function,  and  h{  is  the  parameter  that  controls  the  window  width  for  the  /th 
category.  Equation  (3.5)  is  simply  Equation  (3.1)  weighted  by  the  a  priori  probability  of  a 
given  class.  This  ensures  that  a  Bayes-optimal  solution  can  be  approached  (see  Equation 
(2.7)).  Equation  (3.5)  is  written  in  a  compact  form  below  to  show  its  similarity  to  the 
optimal  Bayes  strategy  of  Equation  (2.8). 

ft (x)  *  Pi(*  |» i )P(<o  i )  (3.6) 

Pattern  classification  of  multidimensional  feature  data  can  be  achieved  by  first  training 
with  the  algorithm  in  Table  3.1,  then  using  Equation  (3.6)  for  testing  by  discriminant 
function  analysis. 

Stepwise  Optimization 

The  WPW  training  algorithm  quantizes  vector  space  is  based  on  a  probability 
space  error  criterion.  Quantization  causes  the  WPW  estimate  to  deviate  from  the  Parzen 
estimate,  thereby  introducing  error  between  the  two.  Once  this  error,  as  measured  by 
Equation  (3.2),  exceeds  a  predetermined  value,  emax,  training  is  halted.  Since  the 
objective  of  the  training  algorithm  is  to  reduce  the  number  of  reference  vectors  without 
introducing  error  into  the  density  estimate,  it  makes  sense  to  minimize  error  for  each 
training  step.  One  way  to  minimize  stepwise  error  is  to  minimize  quantization  error. 
Quantization  error  is  introduced  in  each  step  of  the  training  algorithm  as  a  result  of 
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combining  two  reference  vectors.  In  what  follows,  it  will  be  proven  that  the  WPW 
training  algorithm  minimizes  quantization  error  for  each  step. 

Consider  combining  two  reference  vectors  r,  and  r,  whose  weights  are  w,  and  Wj 
respectively.  The  resulting  reference  vector  rQ  is  given  by  Equation  (3.3),  and  its  weight  is 
given  by  w0  =  wi  +  Wj  (see  Equation  (3 .4)).  On  the  klh  training  step,  r,  and  r y  are  the 
center  of  two  weighted  window  functions  wl-<p(  • )  and  w/<p(  •  ),  which  contribute  to  the 
density  given  by  Equation  (3.1).  On  the  k  +  1st  step,  after  combination,  their  contribution 
is  a  single  weighted  window  function,  w0cp(  • ),  centered  at  rG.  The  quantization  error 
introduced  by  this  combination  is  defined  in  terms  of  the  three  above  mentioned  weighted 
window  functions.  The  volume  enclosed  by  each  of  these  weighted  windows  is  denoted 
by  Vh  Vj,  and  V0.  Given  a  vector  space  91,  the  region  within  V0  is  denoted  as  91 0  The 
regions  of  intersection  (Vj  n  VQ)  and  (Vj  r>  VQ)  are  denoted  as  91  >  and  9ly  respectively. 

The  quantization  error  integral  is  defined  as 


but  proper  selection  of  the  Parzen-window  function,  cp(  • ),  requires  that  its  volume  equal 
1,  so  the  quantization  error  is 


e4s 


Wj+Wj 


(wj + wj)  j  <pr^— - 


Wj  +  Wj 


Wj  +  Wj  -  Wj  j -  Wj  J 


Wj  +  Wj 


W: 


+B,f 


(3.7) 
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Maximum  quantization  error  occurs  when  the  integrals  of  Equation  (3.7)  are  both  equal  to 
0  and  results  in  a  quantization  error  of  1 .  This  is  the  worse  case  scenario  when  Pi  r\  V0) 
and  ( Vj  r\  VQ)  are  equal  to  zero.  Clearly,  the  minimum  quantization  error  occurs  when  the 
integral  terms  are  equal  to  1,  which  results  in  a  quantization  error  of  0.  Equation  (3.7)  is 
exactly  0  if  and  only  if  9?,  and  9?y  are  completely  enclosed  within  91 0.  But,  93,  and  9?y 
can  only  be  completely  enclosed  within  91 0  if  and  only  if  w,<p(  • )  and  v^<p(  • )  are 
completely  enclosed  within  w0<p(  • ).  Geometrically,  this  is  only  true  if  each  window 
function  shares  the  same  center,  i.e.,  when 

ri  —  Tj  —  rG.  (3.8) 

The  best  procedure  to  follow  when  deciding  which  two  vectors  to  combine  is  to 
select  the  vectors  whose  associated  regions  91,  and  9 ly  are  the  largest.  To  maximize  9t, 
and  9ty  ,  Step  4  of  the  training  algorithm  selects  the  two  closest  reference  vectors  as  the 
two  to  be  combined.  The  distance  measure  used  should  be  the  same  as  the  measure  used 
by  the  Parzen-window  function.  The  closest  two  reference  vectors  are  defined  as  the  two 
vectors  r,  and  r,  that  are  closest  to  their  corresponding  rG.  Since  Equation  (3.3)  is  a 
convex  combination  of  two  vectors,  the  two  reference  vectors  which  are  closest  to  their 
corresponding  rc  are  simply  the  two  closest  vectors  rt  and  r y,  where  /  *  j. 

Returning  to  the  proof  of  stepwise  optimality,  it  can  be  stated  that  when  deciding 
which  two  vectors  to  combine,  one  should  select  the  two  closest  as  measured  by 
Equation  (3.9).  By  selecting  the  two  closest  vectors,  the  quantization  error  for  a  given 
step  is  minimized.  Or,  mathematically 

lim  =  0  (3  9) 

where  dw  is  the  distance  measure  used  by  the  window  function.  Since  the  WPW  training 
algorithm  uses  this  procedure,  it  is  stepwise  optimal. 
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Computational  Complexity 


In  this  section,  the  computational  complexity  of  the  WPW  technique  is  discussed 
As  can  be  seen  by  the  training  and  classification  algorithms,  distance  calculations  require 
the  bulk  of  processing  resources.  Therefore,  the  following  analysis  will  determine  the 
order  of  magnitude  of  the  distance  calculations  for  a  single  class  of  data.  The  following 
analysis  is  based  on  serial  computation. 

Training  Complexity.  Distance  calculations  are  required  in  three  of  the  WPW 
training  steps.  As  shown  in  Table  3.1,  distance  calculations  are  required  in 
Steps  1,  3,  and  7.  In  Step  1,  n  probability  calculations  are  required,  each  with  n  distance 
calculations.  Therefore,  the  number  of  distance  calculations,  Vj,  in  Step  1  is  given  by 

v,=  rP-.  (3.10) 

The  number  of  distance  calculations  necessary  in  Step  3  of  the  training  phase  is 
related  to  the  number  of  reference  vectors.  On  the yth  training  step,  there  are  k  reference 
vectors  bounded  by  h  -1  <  k  <  n.  The  lower  limit  of  k  is  given  by  h  -1  because  the  training 
algorithm  always  combines  two  reference  vectors  before  the  error  function  is  calculated. 

In  the  case  of  n  -  1,  the  training  algorithm  terminates,  so  no  distance  calculations  are 
performed  (refer  to  Step  8  of  the  training  algorithm).  Step  3  of  the  training  algorithm 
requires  the  calculation  of  A(£-l)  distance  measures  for  the  yth  step.  The  total  number  of 
calculations  for  this  step  over  the  entire  training  phase  is  given  by 

(3  11) 

k=*-l 

Equation  (3.11)  can  be  used  to  calculate  the  worse  case  serial  requirements.  In  the  worse 
case,  when  n  =  1,  v2max  is  given  by 
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V- 


=2>(*-o 


4=0 


n  n 

=  X*!-2> 

*=i  *= i 

=  i[n(/i  + 1)(2«  + 1)]  -  n(n  + 1) 

-^(n3-")  (312) 

The  distance  calculations  required  in  Step  7  of  the  training  algorithm  are 
dependent  on  the  value  of  h .  During  this  step,  there  are  n  probability  calculations  Each 
probability  calculation  requires  k  distance  calculations  where  n  - 1  <k<n  As  explained 
for  Equation  (3.11),  the  lower  limit  of  k  is  given  by  n-1.  The  total  distance  computations 
necessary  for  Step  7  are  given  by 

(3  13) 

J*i-I 

In  the  worse  case,  v3max  is  given  by 

H 

v,„=nZk 

k=0 

=  n±k 

=  i(i:(«  +  l) 

=  i(n’+n2).  (3.14) 

Combining  Equations  (3.10),  (3.12)  and  (3. 14),  the  total  number  of  distance  calculations 
during  training,  for  the  worse  case,  is  given  by 

Vmin'CK"3)  (3  15) 
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Classification  Complexity  The  number  of  distance  calculations  required  by  a 
single  discriminant  function  is  h  for  each  test  made.  The  largest  h  can  be  is  n  The  upper 
bound  on  the  number  of  distance  calculations  necessary  for  a  single  discriminant  function 
is  given  by 

Vtest^OW  (3.16) 

The  computational  complexity  of  the  WPW  training  algorithm  requires 
significantly  more  distance  calculations  than  the  classification  algorithm  The  number  of 
distance  calculations  for  a  single  class  of  data  is  0(n3).  The  number  of  calculations  for  all 
training  classes  is  still  0(w3)  because  the  number  of  classes  is  generally  much  smaller 
than  n.  Although  calculations  of  this  order  can  be  severe,  it  must  be  noted  that  the  above 
analysis  is  for  worse  case  serial  computation.  Possible  refinements  can  be  made  when 
finding  the  two  closest  vectors  in  Step  4  of  the  training  algorithm,  eg.,  preprocessing  the 
reference  vectors  and  recursively  updating  them  on  every  step.  Several  quick  search 
routines  are  available  to  programmers  [6],  [9],  [23],  [26],  It  may  be  possible  to  modify 
such  a  routine  for  the  WPW  training  algorithm.  Also,  it  may  be  possible  to  store  the 
WPW  density  estimate  in  a  table  updating  it  recursively  on  each  training  step  to  reduce  the 
calculations  necessary  on  Step  7  of  training  algorithm.  Finally,  the  power  of  parallel 
computation  can  be  invoked  making  WPW  training  nearly  trivial  since  all  distance 
calculations  can  be  calculated  simultaneously.  Regardless  of  the  possible  shortcuts,  the 
analysis  of  the  training  algorithm  shows  that  it  will  terminate  after  a  finite  number  of 
training  steps  and  is  at  worst  0 (n3).  The  classification  algorithm  on  the  other  hand  is  at 
worst  O (n). 
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Designing  the  Weighted-Pa rzen-Window  Classifier 


Selecting  the  Window  Shape.  When  selecting  a  window  shape,  those  suitable  for 
the  Parzen  density  estimate  should  be  chosen.  Any  window  shape  satisfying  the 
conditions  as  established  by  Parzen  [19]  are  sufficient,  and  they  should  be  used  for  both 
the  Parzen,  p„(x)  and  WPW  estimates,  p(x).  Several  window  shapes  can  be  found  in 
references  [S]  and  [19].  The  mostly  widely  referenced  window  function  is  Gaussian. 

Selecting  the  Smoothing  Parameter.  The  smoothing  parameter  should  be  the 
same  as  that  used  for  Parzen  estimate,  p„(x).  Choosing  the  smoothing  parameter  is  a 
critical  step  in  the  classifier  design.  Selection  of  the  smoothing  parameter  is  discussed 
throughout  the  literature  [S],  [9],  [10],  [19],  [20].  Due  to  the  intractable  nature  of  an 
analytical  solution,  experimental  approaches  are  generally  used  to  find  the  appropriate 
smoothing  parameter.  Reference  [20:254]  suggests  a  technique  in  which  several 
smoothing  parameters  are  tested  simultaneously  to  find  the  best  choice.  Although  the 
training  and  classification  algorithms  allow  for  selection  of  different  smoothing  parameters 
for  each  class  of  data,  it  is  recommended  to  use  a  single  value  for  all  classes  [20:254], 

Selecting  the  Maximum  Allowable  Error.  The  value  emax  is  used  to  control  the 
training  algorithm's  aggressiveness.  That  is  to  say,  if  is  small  then  the  number  of 
vectors  in  R  will  be  nearly  w;  otherwise,  if  emax  is  large,  then  h  «  n.  Equation  (3.20) 
shows  how  n  is  affected  by  emax  in  the  limit. 


lim  ii 
emax  — >  0 

lim  h 


=  n 


emax  -> 00 


=  1 


(3.20.a) 


(3.20.b) 
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The  relationships  of  Equation  (3.20)  are  helpful  in  understanding  the  effects  ofemax  The 
value  emax  can  be  selected  by  specification  or  engineering  judgment.  In  either  case. 
Equation  (3.2)  allows  for  intuitive  choices  of  emax.  For  example,  if  emax  is  chosen  as 
15%,  then  the  training  algorithm  will  stop  when  the  average  variation  between  the  two 
estimates  exceeds  that  value.  The  value  of  emax  is  in  general  the  same  for  each  category; 
however,  it  can  be  selected  individually  for  each  class.  When  designing  the  classifier,  the 
recommended  procedure  is  shown  in  Table  3.2. 


Table  3.2:  Weighted-Parzen- window  classifier  design  steps. 

1 .  Select  emax  =  0,  and  determine  the  smoothing  parameters  that  minimize 
classification  error4. 

a.  If  the  training  set  is  small,  use  the  leave-one-out  method  to  estimate  the 
error  rate  [5:76]. 

b.  If  the  training  set  is  large,  partition  the  data  into  two  disjoint  sets  for 
training  and  testing  to  estimate  the  error  rate  [5:76]. 

2.  Choose  the  value  of  emax  based  on  design  specifications  or  reduction. 

3.  Train  and  evaluate  the  performance.  If  performance  is  satisfactory,  implement  the 
device;  otherwise  go  to  step  2. 


4  By  choosing  emax  *  0,  the  WPW  classifier  is  equivalent  to  the  Parzen-window  classifier  (this  is  proven 
in  Chapter  4).  Therefore,  the  techniques  for  choosing  the  smoothing  parameter  as  outlined  in  references 
(5].  (9J,  (19],  and  [20]  are  valid  and  should  be  used.  Although  the  training  and  classification  algorithms 
allow  for  selection  of  different  smoothing  parameters  for  each  class  of  data,  it  is  recommended  to  use  a 
single  value  for  all  classes  [20:254], 
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CHAPTER  4 

ANALYTICAL  RESULTS 


The  Gaussian  distribution  function  is  prominent  throughout  pattern  recognition 
literature  because  of  its  analytical  tractability  [5:22].  The  well-known  properties  of  the 
Gaussian  distribution  are  extremely  helpful  when  analyzing  the  WPW  classifier.  In  this 
chapter,  it  is  shown  that  the  performance  of  several  well-known  classifiers  can  be  learned 
by  varying  the  two  system  parameters.  In  particular,  the  Bayes-Gaussian,  minimum 
Euclidean-distance,  Parzen-window,  and  nearest-neighbor  classifiers  are  derived. 

Using  Gaussian  Windows 


To  use  Gaussian  windows.  Equation  (3.S)  is  written  as 


*,(x)  =/<<»,)-!-£ - — 7exp[--ir(x-rji)'(x-rj,)  (4.1) 

'>-‘(2x0/^  L  '  J 


where  r;y  is  the yth  reference  vector  /th  class  and  wy  is  the  corresponding  weight,  d  is  the 
number  of  dimensions,  and  n  t  is  the  number  of  reference  vectors  for  the  ith  class.  Note  it 
is  customary  to  replace  ht  with  o,  to  emphasize  the  relation  of  Equation  (4. 1)  to  the 
Gaussian  distribution. 


Special  Case  Training  Results 

This  section  will  show  that  proper  selection  of  the  system  parameters  a  and  emax 
will  result  in  several  well-known  classifiers.  In  what  follows,  the  training  algorithm  is 


shown  to  be  capable  of  learning  Bayes-Gaussian,  minimum  Euclidean-distance  ,  Parzen- 
window,  and  nearest-neighbor  performance. 

Case  1 :  Bayes-Gaussian  Classifier  Given  a  set  of  training  data  for  c  classes 
labeled  ©/,  /  =  1,  .  ,  c.  Choose  emax  =  ®,  and  a,  as  some  suitable  number.  In  this  case, 
o,  can  be  different  for  each  class.  Consider  the  effect  of  emax.  Since  all  error  of  the 
training  phase  is  tolerated,  a  single  reference  vector  will  be  used  to  represent  each 
categoiy  upon  completion  of  training.  Because  of  the  Equation  (3.3)  used  in  Step  5  of  the 
training  algorithm,  the  reference  vector  of  each  category  is  exactly  the  sample  mean  vector 
of  each  category  Consider  Equation  (4.2)  on  the  final  step  of  training  for  the  /th 

category: 


_  rtlwil  +  ri2^i2 
wn  +  wi2 


(4.2) 


Since  this  is  the  combination  of  the  two  final  reference  vectors,  +  wi2  -  ni  where  n,  is 
the  number  of  reference  vectors  at  the  beginning  of  training,  or  equivalently, 
wi2  =  ni  ■  wib  Equation  (4.2)  can  be  rewritten  as 


I 

>f  w,  i  +(/!/ -w,  ,)  m 


(43) 


Upon  the  completion  of  training,  the  final  reference  vector  is  the  sample  mean  of  the 
training  data,  its  weight  is  =  njt  and  n,  =  1.  With  this  in  mind,  Equation  (4. 1 )  can 
be  rewritten  as 


=  - — J  exp  f-  -iy  (x  -  p, )'  (x 


(4.4) 


which  Amplifies  to 
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&(x)  =  - - — j-exp  -t~t(x  -  p,-)'!*  "  Pi)  (4.5) 

(2»(>)5  L  2ar  J 

By  using  a/2/,  where  /  is  the  identity  matrix,  as  the  covariance  matrix  of  a  Gaussian 
distribution  function.  Equation  (4.5)  can  be  rearranged  as  shown  in  Equation  (4  6) 

&(*)  = - y - f exp ["(x -  Pj)*(oi2/)  '(x -p,)lrtffl,)  (4.6) 

Equation  (4.6)  is  the  familiar  Bayes-Gaussian  classifier  (see  Equation  (2  1 1))  for  a  class- 
conditionally  independent  normal  multivariate  distribution  x|co,  ~  A^ji/,0/2/] .  Decision 
surfaces  that  result  from  discriminant  analysis  are  hyperquadric.  In  the  cases  that  (i{  and 
O/  accurately  reflect  the  training  data,  the  WPW  classifier  is  optimal. 

Case  2:  Minimum  Euclidean-Distance  Classifier.  Given  a  set  training  data  for  c 
classes  labeled  ©,,  /  =  1, . . . ,  c.  Choose  emax  =  <»,  =  o,  and  /’(to,)  =  P( co)  =  1/c  for 

all  r.  Since  all  error  of  the  training  phase  is  tolerated,  a  single  reference  vector  will  be 
used  to  represent  each  category  upon  completion  of  training.  Case  1  shows  that  the 
resulting  discriminant  function  is  given  by 

&(x)  =  P((o) - - — j-exp 

(2no2)2 

Given  the  conditions  stated  above,  this  classifier  behaves  exactly  as  the  minimum 
Euclidean-distance  classifier  of  Equation  (2.22).  Furthermore,  the  decision  boundaries  are 
hyperplanes.  The  following  analysis  shows  why  this  is  true.  First,  the  natural  logarithm 
of  Equation  (4.7)  is  written  as 

£<(x)  =  -tMx  "  M‘(x  ~  Ai)  +  in  />(©)-  ^ln2x  -  d\no.  (4.8) 

2  n  2 
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Since  all  of  the  bias  terms  that  are  the  same  in  each  category,  Equation  (4  8)  can  be 
rewritten  as 

&(x )  =  -t^t(x  -  £,)'(x  -  £,) 

2  O" 

which  can  be  scaled  by  2a2  giving 

&(x)  =  -(x-Ai)'(x-  (49) 

Note  that  the  tilde  notation  indicates  that  the  discriminant  function  has  been  changed,  but 
the  classification  results  have  not.  Equation  (4.9)  is  the  minimum  Euclidean-distance 
classifier  as  given  by  Equation  (2.22).  As  shown  by  Equation  (2.23),  this  case  of  the 
WPW  classifier  results  in  decision  boundaries  that  are  hyperplanes. 

Case  3:  Parzen-  Window  Classifier.  Given  a  set  training  data  for  c  classes  labeled 
©„  /  =  1, c.  Choose  emax  =  0,  and  o  as  some  suitable  number.  Consider  the  affect  of 
emax-  Since  no  error  can  be  tolerated  during  training,  there  will  be  no  reduction  in  the 
reference  vector  set  /?„  therefore,  it  is  equal  to  the  training  set  Xt.  Upon  the  completion  of 
training  for  the  /th  category,  n,  reference  vectors  remain,  each  of  which  has  a  coefficient  of 
1.  Therefore,  wy  -  1,  for  /  *  1, ... ,  it/,  and  n  /  =  it/.  With  this  in  mind.  Equation  (4. 1) 
can  be  rewritten  as 


&(*)  =  ffa.)— Z — — jexpf-^r(x  -,v),(x  ~rff) 
'M2*oi2);  L  2°f  J 


=  - - dexP  (x-*y)‘(*-xy) 


(4.10) 


n<  •/=1  (2no,2)  2 

Equation  (4. 10)  is  the  same  as  Equation  (2.33),  which  is  the  Parzen-window  discriminant 
function.  Another  important  characteristic  of  this  form  of  the  WPW  system  is  that  as 
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->  oo.  Equation  (4. 10)  approaches  the  Bayes-optimal  classifier.  This  is  a  property  of 
the  Parzen-window  approach  coupled  with  a  Bayes  strategy.  (In  this  case,  it  has  been 
assumed  that  no  two  reference  vectors  are  identical.  If  they  were  identical,  they  would  be 
combined  without  introducing  error.  Therefore,  the  WPW  classifier  would  still  be 
identical  to  the  Parzen-window  classifier,  but  it  would  be  difficult  to  analyze.) 

Case  4:  Nearest-Neighbor  Classifier.  Given  a  set  training  data  for  c  classes 
labeled  ©„  /  =  1, .  .  . ,  c.  Choose  emax  -  0,  o,  =  a,  which  is  very  small,  and 
P((Qj)  -  P(<&)  -  Me  for  all  /.  In  this  case,  as  in  the  above  case,  reference  vector 
combinations  are  completely  inhibited,  and  the  set  of  reference  vectors  R,  is  exactly  the 
training  set  Xj.  The  discriminant  function  for  this  case  is  given  by 


/x  11^  1 
*<(*)— -I- - nr«PI 


to,2)! 


2 


(x-xf)Vx«) 


(4.11) 


In  this  case  the  classifier  may  not  necessarily  become  optimal  as  the  number  of  training 
samples  approaches  infinity.  In  fact,  the  best  that  one  may  hope  to  achieve  in  this  case  is 
known  performance.  By  choosing  the  value  of  a  to  be  very  small,  the  window  function 
becomes  very  narrow,  and  only  the  nearest  neighbors  of  a  test  sample  affect  the 
discriminant  function  values.  In  this  case,  Equation  (4. 1 1)  approaches  the  nearest- 
neighbor  (NN)  classifier  [20:254],  but  the  Parzen-window  classifier  is  still  being 
employed.  This  approach  to  NN  pattern  classification  brings  to  light  a  very  interesting 
situation.  Cover  and  Hart  showed  that  the  NN  classifier's  error  rate  can  be  no  worse  than 
twice  the  Bayes  error  rate  [4]  in  the  large  sample  case,  i.e.,  as  /r,  —>  oo.  But,  remember  as 
ttj  -»  oo  the  Parzen  estimate  approaches  the  true  estimate.  So,  Equation  (4.11)  becomes 
optimal  if  and  only  if  the  a  priori  probabilities  of  each  class  are  each  Me.  According  to 
Cover  and  Hart,  the  optimal  error  rate  can  only  occur  in  "the  extreme  cases  of  complete 
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certainty  and  complete  uncertainty"  [4:24].  Therefore,  Equation  (4  11)  is  optimal  for 
complete  certainty  or  complete  uncertainty;  otherwise,  it  must  be  strictly  greater  than  the 
Bayes  rate  but  less  than  or  equal  to  twice  the  Bayes  rate.  Since  Equation  (4  10)  will 
always  provide  an  optimal  classifier  in  the  large  sample  case,  Equation  (4. 11)  would  only 
be  used  when  the  number  of  samples  is  small  and  NN  classification  is  desired. 

The  above  analysis  showed  that  several  well-known  classifiers  could  be  learned  by 
using  Gaussian  window  shapes  and  selecting  the  system  parameters  correctly.  The  first 
two  derivations  rely  on  the  way  reference  vectors  are  combined  during  training  and  the 
window  function  properties.  The  last  two  derivations  rely  on  the  properties  of  the  Parzen 
window  technique  and  the  error  criterion  function,  i.e.,  no  vectors  are  combined.  Since 
the  WPW  algorithm  is  generally  used  to  reduce  the  effective  size  of  the  training  set,  i.e., 
the  storage  requirements,  it  should  be  noted  that  Cases  3  and  4  are  primarily  of  theoretical 
importance  and  are  used  to  bolster  the  credibility  of  the  algorithm.  By  making  use  of  the 
properties  of  the  Gaussian  distribution  and  the  Parzen-window  classifier,  it  has  been 
shown  that  Bayes-Gaussian,  minimum  Euclidean-distance,  Parzen-window,  and  nearest- 
neighbor  classification  can  be  learned  by  the  WPW  algorithm.  This  is  especially  useful 
when  the  WPW  approach  is  thought  of  as  a  black  box.  In  this  case,  known  performance 
classifiers  can  always  be  achieved  with  a  single  black  box  by  simply  tweaking  the  system 
parameters.  In  this  sense,  these  classifiers  can  be  viewed  as  special  cases  of  the  WPW 
algorithm.  The  derivations  above  are  for  extreme  cases  of  training,  i.e.,  when  a  single 
reference  vector  remains  for  each  class  and  when  all  of  the  reference  vectors  remain  after 
training.  Other  cases  are  analytically  intractable,  so  it  becomes  necessary  to  determine 
results  experimentally. 
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CHAPTER  5 

EXPERIMENTAL  RESULTS 

The  performance  of  the  WPW  classifier  is  described  analytically  in  Chapter  4. 
There,  analyses  are  applied  to  special  case  training  scenarios.  The  capabilities  of  the 
WPW  technique  are  difficult  to  demonstrate  analytically  when  the  number  of  reference 
vectors  is  neither  1  nor  n.  Therefore,  the  experimental  results  are  used  to  demonstrate  the 
effectiveness  of  the  WPW  classifier  for  a  variety  of  system  parameter  choices.  This 
chapter  describes  the  experimental  procedure  used  and  results  obtained.  First,  the  data  is 
discussed.  Then,  experimental  training  results  are  demonstrated  by  graphical  portraits  of 
its  clustering  tendencies.  Classification  results  are  also  presented  in  the  form  of  decision 
boundaries  and  decision  error  curves.  Specifically,  the  WPW  algorithm  is  compared  with 
those  of  the  Bayes-Gaussian,  ANN,  and  Parzen-window  classifiers.  Finally,  to 
demonstrate  the  effects  of  the  two  system  parameters,  a  design  curve  is  presented. 

The  Data 

The  data  used  to  demonstrate  the  capabilities  of  the  WPW  algorithm  was 
synthesized  to  be  challenging,  but  also  to  allow  the  analytical  determination  of  the 
Bayesian  error  rate.  Two-dimensional  data  are  used  so  that  the  properties  may  be 
explored  visually,  however  the  WPW  algorithm  can  be  used  for  data  of  any  dimension.  An 
independent  Gaussian  random  variable  in  two  dimensions  with  unit  variance  was  used  to 
create  a  two  category  data  set.  The  first  category  is  bimodal  while  the  second  is  unimodal 
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centered  between  the  modes  of  the  first  one.  The  conditional  densities  are  given  by 
Equation  5.1: 

M*K)  =  j {^exp[“|(x -»Ai,),(x " 4n)]  +  ^exp[-^(x ~ 4i2)'(* " 412)  |(5  1  a) 

and 

p{x  |(o  2 )  =  2“  exp  j  (x  -  42  y  (*  "  42 )]  (5.1  .b) 

where 

4n  =  [0.0  0.0]' 

412  =  [5-o  5.0  y 
42  =  [  2.5  2.5  ]' 

Figure  5.1  shows  the  data  where  Class  I  is  represented  by  squares  while  Class  II  is 
represented  by  triangles.  The  Bayesian  error  rate,  determined  analytically,  is 
approximately  5.0%.  In  all  experiments,  the  a  priori  probability  was  assumed  to  be  equal 
for  each  category,  i.e.,  P(a>[)  -  />(©2)  =  0.5.  Figure  5. 1  shows  a  subset  of  the  synthesized 
samples. 
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Figure  5. 1 :  Two  category  sample  data. 


35 


Training  Results 


A  data  set  containing  100  samples  in  each  category  is  used  to  show  the  clustering 
tendencies  of  the  training  algorithm  as  the  design  parameters  vary.  In  all  experiments,  the 
same  value  of  emax  and  a  is  used  for  each  class.  The  value  of  o  was  selected  as  outlined 
in  Table  3.2  Step  (l.a).  Figure  5.2  shows  the  average  error  rate  for  the  Parzen-window 
classifier  as  a  function  of  a.  Based  on  Figure  5.2,  c  was  selected  as  1 .0  for  the  following 
experiments.  Figures  5.3-6  show  the  reference  vectors  after  training  for  emax  equal 
to  2.5%,  5.0%,  15.0%,  and  60.0%  respectively.  The  vectors  are  shown  with  their  size 
proportional  to  their  corresponding  weights  w.  Table  5.1  lists  the  number  of  reference 
vectors  that  resulted  after  training.  It  can  be  seen  that  whenever  the  given  samples  formed 
compact  clusters,  they  were  collapsed  into  a  single  vector.  On  the  other  hand,  those 
samples  that  were  relatively  isolated  were  preserved.  Note  that  in  Figure  5.6,  the 
reference  vectors  represent  the  sample  mean  of  each  of  the  modes. 
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Figure  5.2:  Parzen-window  classification  error  vs.  smoothing  parameter. 


Table  5. 1 :  Effect  ofemax  on  the  number  of  reference  vectors  (o  =  1.0). 
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Figure  5.3:  Reference  vectors  after  training  iemax  -  2.5%,  a  -  1.0). 
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Figure  5.4:  Reference  vectors  after  training  (emax  *  5.0%,  o  *  1.0). 
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Figure  5.5:  Reference  vectors  after  training  (emax  =  15.0%,  a  -  1.0) 


Figure  5.6:  Reference  vectors  after  training  (emax  -  60.0%,  o  =  1.0) 
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Classification  Results 


Decision  boundary  graphics  provide  an  excellent  tool  for  visual  analysis  of 
classifier  performance.  The  WPW  algorithm  was  compared  to  the  Bayes-Gaussian, 
Parzen-window,  and  ^-nearest-neighbor  classifiers. 

Figures  5.7  shows  the  decision  boundaries  for  the  Bayesian  classifier.  Figures 
5.8-5. 1 1  show  the  decision  boundaries  that  result  from  the  Parzen-window  classifier  when 
the  smoothing  parameter  is  0.1,  0.5,  1.0,  and  2.0  respectively.  These  figures  demonstrate 
the  effect  of  the  smoothing  parameter.  When  o  is  small,  the  decision  boundaries  are  much 
like  the  NN  classifier.  As  o  increases,  the  decision  boundaries  change  from  complex 
undulating  lines  to  very  straight  lines.  Figures  5. 12-5. 15  show  the  decision  boundaries  for 
the  1,  3,  7,  and  21  nearest-neighbor  rules  respectively.  Note  that  the  NN  decision  rule 
results  in  a  decision  boundary  that  is  jagged  and  very  specific  to  the  training  data. 
However,  when  the  3NN  and  7NN  rules  are  used,  the  decision  boundary  becomes 
smoother  and  more  general.  The  2  INN  rule's  decision  boundary  is  smoothest,  resulting  in 
the  most  general  of  the  nearest-neighbor  classifiers  shown. 
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Figure  5.7:  Bayesian  decision  boundaries. 


Figure  5.8:  Parzen-window  decision  boundaries  (o  =  0.1). 
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Figure  5.11:  Parzen-window  decision  boundaries  (a  =  2.0). 
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Figure  5. 12:  Decision  boundaries  for  NN  classifier. 
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Figure  5.15:  Decision  boundaries  for  2 INN  classifier. 
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To  demonstrate  the  WPW  classifier  graphically,  decision  boundaries  were  plotted 
The  classifier  was  designed  for  ct  =  1 .0  and  varying  emax.  The  WPW  decision  rule 
produced  the  decision  boundaries  shown  in  Figures  5. 16-5. 19  for  emax  =  2.5%,  5.0%, 
15.0%,  and  60.0%  respectively.  The  decision  boundaries  are  nearly  the  same  as  those 
produced  by  the  Parzen- window  decision  rule  shown  in  Figure  5.10  and  similar  to  the 
decision  boundaries  of  the  2  INN  rule  shown  in  Figure  5.15.  The  close  relationship  to  the 
Parzen-window  decision  boundaries  of  Figure  5.10  is  expected  since  the  WPW  algorithm 
tries  to  maintain  the  Parzen  estimate  while  reducing  the  effective  size  of  the  training  set. 
Note  that  the  decision  boundaries  are  very  similar  to  those  of  Figure  5  .10  even  though  the 
reference  vectors  represent  only  28%,  17.5%,  8.0%,  and  1.5%  of  the  original  training 
samples.  This  represents  a  significant  storage  reduction  while  maintaining  excellent 
performance.  Note  that  the  decision  boundaries  of  Figure  5. 19  are  nearly  optimal.  The 
above  analysis  presents  a  visual  comparison  of  several  classifiers.  The  Bayes  classifier  is 
optimal,  but  requires  knowledge  of  the  data's  structure  a  priori.  On  the  other  hand,  the 
nonparametric  classifiers  do  not  require  this  knowledge  but  may  require  excessive  storage 
and  computation  time  during  the  classification  stage.  As  seen  by  the  figures,  the 
nonparametric  classifiers  were  capable  of  achieving  excellent  results.  In  particular,  the 
WPW  classifier  performed  as  well  as  the  others  with  the  bonus  of  requiring  fewer 
reference  vectors  and  hence  less  memory  and  computational  time  during  classification. 

To  compare  the  WPW  algorithm  another  way,  the  total  error  rate  as  a  function  of 
the  data  set  size  was  calculated  for  the  Parzen-window,  ^-nearest-neighbor,  and  WPW 
classifiers.  In  this  case,  the  number  of  training  samples  was  as  large  as  the  number  of  test 
samples.  Figures  5.20,  5.21,  and  5.22  show  the  error  rates.  The  error  shown  was 
calculated  by  training  and  then  testing  with  different  but  equal  size  sets.  The  curves  show 
that  the  WPWs  performance  is  excellent  when  compared  to  the  others. 
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Figure  5. 16:  Weighted-Parzen-window  decision  boundaries  (emax  -  2.5%,  o  =  1.0). 
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Figure  5,17:  Weighted-Parzen-window  decision  boundaries  (emax  -  5.0%,  o  =  1.0). 
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Figure  5.18:  Weighted-Parzen-window  decision  boundaries  (emax  =  15.0%,  a  =  1.0). 
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Figure  5. 19:  Weighted-Parzen-window  decision  boundaries  ( emax  =  60.0%,  a  =  1 .0). 
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Figure  5.20:  Parzen  window  classification  error  for  various  smoothing  parameters. 
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Figure  5.21:  Classification  error  for  Ar-nearest-neighbor. 
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Parameter  Design  Curve 


The  selection  of  the  parameters  emax  and  o  have  a  direct  impact  on  the 
performance  of  the  system.  The  effect  of  their  different  values  is  summarized  in  the  curves 
given  by  Figure  5.23.  These  curves  were  constructed  by  training  the  system  with  100 
samples  of  each  class  for  different  values  of  emax  and  a.  For  each  of  these  training 
phases,  the  average  error  rate  and  average  total  of  the  reference  vectors,  hlotal,  were 

obtained  by  using  the  leave-one-out  technique  [5:76],  Thus,  a  point  in  a  curve  of  Figure 
5.23  corresponds  to  a  pair  of  values  ( emax,  a),  and  it  is  given  by  the  corresponding  error 
rate  in  the  classification  stage  and  nlolal.  For  the  data  used  in  this  experiment,  it  was 

observed  that  as  the  allowed  error  emax  increased,  the  number  reference  vectors  decreased 
and  the  classification  error  increased.  Therefore,  in  this  case,  there  is  a  trade-off  between 
the  classification  error  rate  and  the  total  number  of  reference  vectors.  However,  given  a 
desired  level  of  performance  (in  terms  of  the  desired  classification  rate  and  time  and 
memory  requirements)  the  curves  in  Figure  5.22  provide  the  required  values  of  the 
parameters  emax  and  o.  This  experiment  required  significant  computing  resources. 
However,  if  they  are  available,  the  WPW  system  can  easily  be  designed  by  creating  curves 
such  as  those  in  Figure  5.23  for  many  values  of  o.  Once  the  curves  are  plotted,  the 
designer  simply  chooses  the  parameters  that  provide  the  smallest  error  rate  and  the 
smallest  required  reference  vectors.  Such  a  design  scheme  can  be  implemented  by  the 
algorithm  shown  in  Table  5.2.  Note  that  WPW  classifier  design  is  deterministic  if 
sufficient  computing  resources  are  available,  i.e.,  the  system  parameters  can  be  found  that 
yield  the  desired  performance. 
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Figure  5.23.  Parameter  design  curve. 
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Table  5.2:  Deterministic  weighted-Parzen-window  classifier  design. 

Initialize  o  =  [  <j|,  .  . ,  Cj  ]f,  where  j  can  be  any  reasonable  positive  integer, 

initialize k -  1,  where  1  £k<j,  and  choose  A  very  small.  (The  smoothing 
parameter  and  emax  are  assumed  the  same  for  each  class  to  simplify  this  procedure. 
See  reference  [20:254]  for  selection  of  the  smoothing  parameter.) 

Initialize  emax  =  0.0  and  select  o*. 

Use  the  parameters  (emax,  a*)  for  WPW  training. 

Estimate  and  store  the  total  error  rate  and  total  number  of  reference  vectors  for  all 
classes,  ntotal. 

a.  If  the  training  set  is  small,  use  the  leave-one-out  method  when  estimating 
the  error  rate  [5:76]  and  find  the  average  number  of  reference  vectors. 

b.  If  the  training  set  is  large,  partition  the  data  into  two  disjoint  sets  for 
training  and  testing  to  estimate  the  error  rate  [5:76]  and  find  htotal. 

IF  (  hi  >  1  for  any  i )  THEN 

emax  emax  +  & 

go  to  Step  3 

ELSEIF  ( n,  =  1  for  all  / )  THEN 
k+-k+  1 
go  to  Step  2 
ELSE (k>j)  Then 
stop 

END. 

Plot  classification  error  vs.  n/oto/and  choose  the  parameters  (a*,  emax)  that  meet 
or  exceed  design  specifications  for  storage  and  classification  error. 


CHAPTER  6 
CONCLUSION 


In  this  chapter,  a  summary  of  the  thesis  is  presented.  Then,  future  research  goals 
are  discussed.  Finally,  key  research  results  are  highlighted. 

Summary 

This  thesis  presents  a  novel  pattern  recognition  approach,  named  Weighted  Parzen 
Windows  (WPW).  In  Chapter  2,  several  pattern  recognition  concepts  and  techniques  are 
discussed.  They  are  fundamental  to  the  formulation  and  analysis  of  the  WPW  technique. 
Techniques  discussed  in  Chapter  2  are  the  Bayes,  minimum  Euclidean-distance,  Parzen- 
window,  and  nearest-neighbor  classifiers.  It  is  shown  that  Bayes-Gaussian  classifier  is 
quadratic  in  the  general  case  and  linear  when  the  covariance  matrices  are  equal  for  each 
class.  Two  minimum  distance  classifiers  are  presented  in  Chapter  2.  It  is  shown  that  the 
Mahalanobis-distance  classifier  is  quadratic,  and  the  Euclidean-distance  classifier  is  linear. 
These  classifiers  are  the  first  of  the  nonparametric  techniques  reviewed,  i.e.,  the  techniques 
that  do  not  require  a  priori  knowledge  of  the  density.  The  next  technique  is  the  Parzen- 
window  density  estimate.  This  technique  is  used  to  estimate  a  density  function.  The 
Parzen-window  classifier  is  asymptotically  optimal  when  used  in  a  Bayes  strategy.  The 
final  technique  presented  in  Chapter  2  is  the  ^-nearest-neighbor  (JfcNN).  In  Chapter  3,  The 
weighted-Parzen- window  (WPW)  technique  is  presented.  This  technique  uses  a 
nonparametric  supervised  learning  algorithm  to  estimate  the  underlying  density  function 
for  each  set  of  training  data.  The  WPW  training  algorithm  quantizes  vector  space  based 
on  a  probability  space  error  criterion.  It  is  a  second  order  approximation  since  it  is  an 
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estimate  of  the  Parzen  estimate.  Classification  is  accomplished  by  using  the  WPW  density 
estimate  in  a  Bayes  strategy.  In  Chapter  3,  it  is  proven  that  the  WPW  training  algorithm  is 
stepwise  optimal.  Also,  it  is  shown  that  the  number  of  distance  calculations  required  is 
O (»3)  for  training  and  O(w)  for  classification.  Also,  an  algorithm  for  WPW  classifier 
design  is  presented.  In  Chapter  4,  it  is  shown  that  several  well-known  classifiers  can  be 
teamed  by  using  Gaussian  window  shapes  and  selecting  the  system  parameters  correctly 
By  making  use  of  the  training  algorithm  properties,  it  is  shown  that  Bayes-Gaussian, 
minimum  Euclidean-distance,  Parzen-window,  and  nearest-neighbor  classification  can  be 
learned  by  the  WPW  algorithm.  Analytical  results  are  for  the  cases  when  a  single 
reference  vector  remains  after  training  or  when  all  reference  vectors  remain  after  training. 
Other  cases  are  analytically  intractable  so  it  was  necessary  to  determine  results 
experimentally.  In  Chapter  5,  experimental  results  are  reported  to  demonstrate  the 
performance  of  the  WPW  algorithm  as  compared  to  traditional  classifiers. 

Future  Research  Efforts 

This  research  has  focused  on  the  development  and  analysis  of  the  WPW  approach 
as  a  statistical  classifier.  Therefore,  statistical  pattern  recognition  techniques  and 
philosophies  have  been  paramount  to  the  development  and  analysis  of  the  novel  approach. 
However,  statistical  pattern  recognition  is  only  one  possible  use  of  the  WPW  technique.  It 
has  been  designed  with  other  uses  in  mind.  Other  applications  of  the  WPW  include 
Artificial  Neural  Networks  (ANNs)  and  Vector  Quantization  (VQ).  Future  research 
efforts  will  be  directed  towards  formalizing  doctrine  for  these  uses.  Also,  future  research 
will  explore  WPW  training  algorithm  refinement. 
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The  WPW  as  an  Artificial  Neural  Network.  The  WPW  technique  can  be  used  in 
many  neural  applications.  The  WPW  structure,  training  algorithm,  and  performance  make 
it  very  appealing  for  Artificial  Neural  Network  (ANN)  applications. 

The  structure  of  the  WPW  classifier,  in  general,  is  similar  to  Radial  Basis  Function 
(RBF)  networks  [3],  [16],  [17].  The  WPW  classifier  with  Gaussian  windows  is  a  RBF  as 
defined  by  current  literature.  However,  the  RBF  training  paradigms  are  theoretically  and 
fundamentally  different  from  the  WPW.  The  training  algorithm  of  the  WPW  is  designed 
to  estimate  the  density  of  each  class.  Therefore,  a  Bayes  approach  can  always  be 
employed.  In  general,  RBFs  do  not  estimate  class  densities.  The  Probabilistic  Neural 
Network  (PNN)  [24]  does  try  to  estimate  class  densities,  however,  there  are  no  provisions 
for  reducing  the  storage  requirements. 

Many  different  learning  paradigms  exist  for  ANN  training.  Training  usually 
involves  iteratively  updating  synaptic  weights.  One  such  ANN  is  the  error-back- 
propagation  (EBP)  technique  of  D.  E.  Rumelhart  et  al.  [21],  Until  recently,  the  structure 
of  many  ANNs  were  designed  using  ad  hoc  techniques  [2].  However,  N.  K.  Bose  et  al., 
have  presented  a  technique  by  which  the  structure  and  weights  are  trained  simultaneously 
[2].  In  this  technique,  Voronoi  diagrams  (VoDs)  are  used  to  design  the  weights  and 
structure  of  a  feed-forward  network.  This  classifier  is  trained  without  the  use  of 
parameters.  The  WPW  classifier  also  trains  both  the  weights  and  the  structure  of  a  feed¬ 
forward  network.  In  the  WPW  technique,  the  number  of  reference  vectors,  i.e.,  the 
structure,  and  values  of  the  weights  are  determined  by  selection  of  two  parameters. 

Self-organizing  maps  (SOMs)  are  popular  among  neural  techniques  because  of 
their  self-organizing  properties.  However,  they  are  not  generally  used  directly  for  pattern 
classification.  Recently,  J.  A.  Kangas  et  al.,  reported  that  SOMs  can  be  used  as  a 
preprocessing  step  for  ANNs  [13],  Specifically,  SOMs  can  be  used  to  generate 
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codebooks  for  Learning  Vector  Quantization  (LVQ)  techniques  The  WPW  algorithm  can 
also  be  used  in  this  capacity,  i.e.,  it  can  preprocess  the  neurons. 

The  WPW  technique  can  be  used  for  many  neural  applications;  therefore,  future 
research  will  emphasize  these  applications.  In  particular,  the  neural  structure  for  training 
will  be  developed.  Also,  comparisons  will  be  made  with  various  ANNs  In  particular, 
comparisons  will  be  made  with  the  following  techniques:  VoD-based  classifier  [2],  EBP 
[21],  LVQ  [14],  [15],  LVQ2  [14],  [15],  and  the  PNN  [24],  Comparisons  will  be  made 
with  the  structure,  philosophy,  and  performance  of  each  of  these  techniques. 

The  WPW  as  a  Vector  Quantizer.  Vector  Quantization  (VQ)  is  used  for  speech 
and  image  data  compression.  Data  are  compressed  and  transmitted  serially  to  a  receiver 
where  they  are  decompressed.  Compression  is  accomplished  by  using  a  codebook  to  map 
the  transmission  data  into  one  of  the  codebook  vectors.  This  mapping  quantizes  the 
transmission  data.  Reference  [1]  is  an  excellent  source  for  further  details  if  necessary. 
Generation  of  a  good  codebook  is  the  main  problem  in  VQ.  The  WPW  training  algorithm 
is  directly  applicable  to  codebook  generation.  Image  codebooks  can  be  generated  by  the 
WPW  algorithm  for  gray-level  and  color  images.  Future  research  will  be  directed  towards 
mapping  the  WPW  training  algorithm  into  a  parallel  structure.  Also,  experiments  will  be 
run  to  test  the  effectiveness  of  the  WPW  algorithm  as  compared  to  traditional  codebook 
generation  techniques.  (Note,  in  codebook  generation,  the  required  number  of 
calculations  can  be  severe  with  the  WPW  technique.  Calculation  reduction  refinements 
are  discussed  in  the  following  section.) 

WPW  Refinements.  Research  efforts  to  this  point  have  been  focused  on  the 
introduction  of  the  WPW  technique.  Mainly,  research  has  been  concerned  with  the 
fundamental  theory  of  the  technique  and  its  validity.  Future  research,  however,  will 
concentrate  on  refinements  that  will  ensure  efficient  practical  application.  This  means 
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reducing  the  number  of  calculations  required  during  training  and  developing  the 
adaptability  of  the  classifier. 

The  training  algorithm  was  shown  to  be  0(w3),  therefore,  an  obvious  refinement  is 
to  decrease  the  serial  calculations  necessary.  Possible  refinements  can  be  made  when 
finding  the  two  closest  vectors  in  Step  4  of  the  training  algorithm,  eg.,  preprocessing  the 
reference  vectors  and  recursively  updating  them  on  every  step.  Several  quick  search 
routines  are  available  to  programmers  [6],  [9],  [23],  [26],  It  may  be  possible  to  modify 
such  a  routine  for  the  WPW  training  algorithm.  Also,  it  may  be  possible  to  store  the 
WPW  density  estimate  in  a  table  and  updating  it  recursively  on  each  training  step  to 
reduce  the  calculations  necessary  in  Step  7  of  the  training  algorithm.  Parallel  computation 
approaches  will  be  explored.  The  classifier  is  highly  parallel  in  structure,  therefore,  future 
research  will  involve  mapping  the  classifier  into  a  parallel  architecture.  With  a  parallel 
approach,  WPW  training  is  nearly  trivial  since  all  distance  calculations  can  be  made 
simultaneously. 

The  classifier  structure  must  be  enhanced  by  providing  a  technique  to  incorporate 
new  exemplars  after  training,  i.e.,  develop  a  mechanism  for  updating.  Future  research  will 
focus  on  a  technique  that  simply  includes  new  exemplars  with  a  single  window  function. 
Tentatively,  this  approach  will  store  all  new  exemplars  until  the  storage  capabilities  can  no 
longer  accommodate  them.  At  which  point,  the  training  algorithm  can  be  invoked  to 
combine  vectors  where  possible.  Again,  this  is  a  tentative  plan  which  must  be  further 
developed. 
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Conclusions 


In  this  thesis,  the  WPW  classifier  was  introduced.  The  WPW  classifier  is  a 
nonparametric  supervised  learning  technique  for  pattern  recognition.  This  technique 
assumes  nothing  about  the  underlying  density  function,  but  approximates  the  density 
function.  Since  the  approach  of  the  WPW  is  to  estimate  the  density  of  each  class,  training 
can  be  conducted  separately,  which  means  that  training  can  be  conducted  in  parallel  This 
is  an  important  advantage  for  parallel  processing  and  hardware  implementation.  Density 
estimation  is  an  important  concept,  because  the  classifier  is  designed  to  approximate  the 
minimum  risk  Bayes  classifier.  This  is  the  fundamental  underlying  philosophy  of  the 
approach.  However,  it  is  often  difficult  to  store  all  of  the  training  samples  available  for  the 
Bayesian  nonparametric  approach.  This  is  why  the  WPW  technique  has  been  developed. 
The  WPW  technique  of  training  set  reduction  is  unique  in  that  it  employs  two  distortion 
measures.  The  first  is  a  vector  space  distortion  measure;  the  second  is  a  probability  space 
distortion  measure.  This  approach  quantizes  vector  space  with  respect  to  probability 
space.  The  author  is  unaware  of  any  techniques  of  this  type.  Furthermore,  the  WPW 
technique  was  developed  and  analyzed  with  traditional  statistical  techniques.  In 
Chapter  3,  it  is  proven  that  the  WPW  training  algorithm  is  stepwise  optimal.  Also,  the 
computational  complexity  is  discussed.  In  Chapter  4,  the  WPW  is  shown  to  be 
functiooally  equivalent  to  several  well-known  classifiers.  Chapter  5  gives  experimental 
results. 

In  Chapter  3,  two  important  results  were  developed-stepwise  optimization  and 
computational  complexity.  The  training  algorithm  of  the  WPW  is  stepwise  optimal.  This 
means  that  each  training  step  introduces  the  smallest  quantization  error  possible.  Also,  in 
Chapter  3,  computational  complexity  is  derived.  It  is  shown  that  the  serial  distance 
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calculations  necessary  are  0(«3)  for  training  and  are  O(n)  for  classification.  Although  the 
training  algorithm  is  0(«3),  the  classification  algorithm  is  very  fast,  0(«)  Furthermore, 
the  training  algorithm  will  converge  to  a  solution  after  a  finite  number  of  training  steps. 
This  is  an  important  result  when  considering  that  some  neural  techniques  may  never 
converge  to  a  solution. 

Chapter  4  presents  several  important  analytical  results.  In  this  chapter,  it  is  shown 
that  Bayes-Gaussian,  minimum  Euclidean-distance.  Parzen-window,  and  nearest-neighbor 
classification  can  be  learned  by  the  WPW  algorithm.  This  is  especially  useful  when  the 
WPW  approach  is  thought  of  as  a  black  box.  In  this  case,  known  performance  classifiers 
can  always  be  achieved  with  a  single  black  box  by  simply  tweaking  the  system  parameters. 
In  this  sense,  the  above  traditional  classifiers  can  be  viewed  as  special  cases  of  the  WPW 
algorithm.  Chapter  4  also  shows  that  Bayes  performance  can  be  achieved  in  certain 
cases.  This  is  of  fundamental  importance  because  the  Bayes  error  rate  is  the  theoretical 
bound. 

In  chapter  5,  experimental  results  show  that  the  WPW  classifier  is  comparable  if 
not  superior  to  some  traditional  techniques.  Decision  boundary  graphics  and  error  curves 
were  used  to  show  that  the  WPW  approach  reduces  the  effective  size  of  the  training  data 
without  introducing  significant  classification  error. 

The  WPW  technique  is  a  very  powerful  statistical  pattern  recognition  approach. 
Excellent  performance  was  demonstrated  theoretically  and  experimentally.  The  WPW 
technique  encompasses  many  important  concepts  from  statistical  pattern  recognition. 
However,  the  use  of  the  WPW  technique  should  not  be  limited  to  statistical  pattern 
recognition.  The  WPW  can  be  used  in  neural  applications  as  well  as  codebook  generation 
for  vector  quantization.  These  applications  are  the  subject  of  current  and  future  research. 
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